An Efficient Framework to Extract Parallel Units from Comparable Data
نویسندگان
چکیده
Since the quality of statistical machine translation (SMT) is heavily dependent upon the size and quality of training data, many approaches have been proposed for automatically mining bilingual text from comparable corpora. However, the existing solutions are restricted to extract either bilingual sentences or sub-sentential fragments. Instead, we present an efficient framework to extract both sentential and sub-sentential units. At sentential level, we consider the parallel sentence identification as a classification problem and extract more representative and effective features. At sub-sentential level, we refer to the idea of phrase table’s acquisition in SMT to extract parallel fragments. A novel word alignment model is specially designed for comparable sentence pairs and parallel fragments can be extracted based on such word alignment. We integrate the two levels’ extraction task into a united framework. Experimental results on SMT show that the baseline SMT system can achieve significant improvement by adding those extra-mined knowledge.
منابع مشابه
Efficiency Measurement of Clinical Units Using Integrated Independent Component Analysis-DEA Model under Fuzzy Conditions
Background and Objectives: Evaluating the performance of clinical units is critical for effective management of health settings. Certain assessment of clinical variables for performance analysis is not always possible, calling for use of uncertainty theory. This study aimed to develop and evaluate an integrated independent component analysis-fuzzy-data envelopment analysis approach to accurate ...
متن کاملاستخراج پیکره موازی از اسناد قابلمقایسه برای بهبود کیفیت ترجمه در سیستمهای ترجمه ماشینی
Data used for training statistical machine translation method are usually prepared from three resources: parallel, non-parallel and comparable text corpora. Parallel corpora are an ideal resource for translation but due to lack of these kinds of texts, non-parallel and comparable corpora are used either for parallel text extraction. Most of existing methods for exploiting comparable corpora loo...
متن کاملUsing MOLP based procedures to solve DEA problems
Data envelopment analysis (DEA) is a technique used to evaluate the relative efficiency of comparable decision making units (DMUs) with multiple input-output. It computes a scalar measure of efficiency and discriminates between efficient and inefficient DMUs. It can also provide reference units for inefficient DMUs without consideration of the decision makers’ (DMs) preferences. In this paper, ...
متن کاملA Recurrent Neural Network to Identify Efficient Decision Making Units in Data Envelopment Analysis
In this paper we present a recurrent neural network model to recognize efficient Decision Making Units(DMUs) in Data Envelopment Analysis(DEA). The proposed neural network model is derived from an unconstrained minimization problem. In theoretical aspect, it is shown that the proposed neural network is stable in the sense of lyapunov and globally convergent. The proposed model has a single-laye...
متن کاملAn Efficient Framework for Extracting Parallel Sentences from Non-Parallel Corpora
Automatically building a large bilingual corpus that contains millions of words is always a challenging task. In particular in case of low-resource languages, it is difficult to find an existing parallel corpus which is large enough for building a real statistical machine translation. However, comparable non-parallel corpora are richly available in the Internet environment, such as in Wikipedia...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2013